Trivia About Trivia¶

The Backstory¶

Throughout college, every Tuesday night I could, I would go to a bar in downtown Gainesville with a group of friends to play trivia. While at trivia one night we began talking about what makes a good trivia team. We decided that there were certain topics that always popped up in trivia and a good team would have one or two people who knew a lot about those topics. The ones we decided were:

sports
offbeat, esoteric movies and TV
geography
history
current events

The Setup¶

As you could imagine, immediately after having this conversation, we began arguing about what player was the most important. So I began thinking....

"What topics come up in trivia the most?"¶

As a trivia nerd, I've spent more than my fair share of time watching Jeopardy. And when I found a dataset of all the Jeopardy questions and answers used in the shows 15 year history, I knew what I had to do.

Exploring the Data
Building the Name Entity Recognizer
Visualizing the Results
Wrap Up

Exploring the Data¶

First things first, we need to bring the csv in and clean it up a bit. First, we can remove all the punctuation and standardize the data a bit. Then there's a few pieces that aren't super relevant to our project, so let's get those out.

import pandas as pd
import matplotlib.pyplot as plt

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)

jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Now that we only have the columns we want, I'm going to remove some of the punctuation to make things easier on down the road

import re

def removePunct(word):
    word = re.sub('[^A-Za-z0-9\s]','', word)
    word = word.lower()
    return word

jeopardy['clean_question'] = jeopardy['Question'].apply(removePunct)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(removePunct)

jeopardy.head(5)

new = jeopardy[['clean_question', 'clean_answer']]

import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

True

new.head()

Now that everything is a bit cleaner, let's get to some NLP work. First we'll tokenize the strings and remove some stop words to help with efficiency. I played around with stemming or lemmatizing the data but decided that it ended up hurting the end result more than it helped.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def removeStop(para):
    words = word_tokenize(para)
    useful_words = []
    for i in words:
        if i not in stopwords.words('english'):
            useful_words.append(i)
    return (' ').join(useful_words)

new['final_question'] = new['clean_question'].apply(removeStop)
new['final_answer'] = new['clean_answer'].apply(removeStop)

/Users/allisonkahn/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
/Users/allisonkahn/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

new.head()

Building the Name Entity Recognizer¶

import spacy
from spacy import displacy
from collections import Counter

all_ners = []

nlp = spacy.load('en')

for ex in new['final_question']:
    doc = nlp(ex)
    for chunk in doc.noun_chunks:
        all_ners.append(chunk.text)

len(all_ners)

41905

c = Counter(all_ners)
c = c.most_common(200)

There's a bunch of clutter in there that got past the stopwords wall the first time around. I'm just going to get rid of the obvious ones off the top and then we'll go through this with the answers.

c = c[9:]
c = dict(c)

all_ners = []

for ex in new['final_answer']:
    doc = nlp(ex)
    for chunk in doc.noun_chunks:
        all_ners.append(chunk.text)
        
c2 = Counter(all_ners)
c2 = c2.most_common(200)
c2 = dict(c2)

Visualizing the Results¶

from wordcloud import WordCloud
import matplotlib.pyplot as plt

word_could_dict= c
wordcloud = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict)

plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()

from wordcloud import WordCloud
import matplotlib.pyplot as plt

word_could_dict= c2
wordcloud = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict)

plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()

Wrap Up¶

And that wraps up this analysis! There's plenty of other routes to go down with this dataset that could lead to some interesting findings. With the results of our NER, we know that in order to have the best shot at Jeopardy!, you should focus on:

US States
Countries (specifically Western Europe, English-speaking or East Asian)
Eurocentric names
Royal lineages

Of course, NERs are suseptible problems like anything else. This analysis picks up phrases that appear most often, not necessarily topics that appear most often. We could utilize a knowledge base or other method of tracing larger topics to find a better list of subject.

Thanks for reading!¶

Let me know if you have any comments, questions or thoughts!

allison.kahn.12@gmail.com

	Show Number	Air Date	Round	Category	Value	Question	Answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus	for the last 8 years of his life galileo was u...	copernicus
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe	no 2 1912 olympian football star at carlisle i...	jim thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona	the city of yuma in this state has a record av...	arizona
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's	in 1963 live on the art linkletter show this c...	mcdonalds
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams	signer of the dec of indep framer of the const...	john adams

	clean_question	clean_answer
0	for the last 8 years of his life galileo was u...	copernicus
1	no 2 1912 olympian football star at carlisle i...	jim thorpe
2	the city of yuma in this state has a record av...	arizona
3	in 1963 live on the art linkletter show this c...	mcdonalds
4	signer of the dec of indep framer of the const...	john adams

	clean_question	clean_answer	final_question	final_answer
0	for the last 8 years of his life galileo was u...	copernicus	last 8 years life galileo house arrest espousi...	copernicus
1	no 2 1912 olympian football star at carlisle i...	jim thorpe	2 1912 olympian football star carlisle indian ...	jim thorpe
2	the city of yuma in this state has a record av...	arizona	city yuma state record average 4055 hours suns...	arizona
3	in 1963 live on the art linkletter show this c...	mcdonalds	1963 live art linkletter show company served b...	mcdonalds
4	signer of the dec of indep framer of the const...	john adams	signer dec indep framer constitution mass seco...	john adams